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AN ADAPTIVE RIDGE PROCEDURE 
FOR Lq regularization 

By Florian Frommlet and Gregory Nuel 

Penalized selection criteria like AIC or BIG are among the most 
popular methods for variable selection. Their theoretical properties 
have been studied intensively and are well understood, but making 
use of them in case of high-dimensional data is difficult due to the 
non-convex optimization problem induced by Lq penalties. An ele¬ 
gant solution to this problem is provided by the multi-step adaptive 
lasso, where iteratively weighted lasso problems are solved, whose 
weights are updated in such a way that the procedure converges 
towards selection with Lo penalties. In this paper we introduce an 
adaptive ridge procedure (AR) which mimics the adaptive lasso, but 
is based on weighted Ridge problems. After introducing AR its the¬ 
oretical properties are studied in the particular case of orthogonal 
linear regression. For the non-orthogonal case extensive simulations 
are performed to assess the performance of AR. In case of Poisson 
regression and logistic regression it is illustrated how the iterative 
procedure of AR can be combined with iterative maximization pro¬ 
cedures. The paper ends with an efficient implementation of AR in 
the context of least-squares segmentation. 


1. Introduction. Methods for performing variable selection, particu¬ 
larly in a high dimensional setting, have undergone tremendous develop¬ 
ment over the last two decades. Of particular importance in this context 
is penalized maximum likelihood estimation, which can be divided in selec¬ 
tion methods based on generalized information criteria and regularization 
methods [? ]. The former use a penalty which depends on the number of 
estimated parameters, sometimes called Lq penalty, and include the classi¬ 
cal information criteria AIC [? ] and BIG [? ]. Their asymptotic properties 
have been thoroughly studied and are well understood when the number of 
potential regressors is fixed (see for example [? ] and citations given there). 
Specihcally BIG is known to yield a consistent model selection rule, which 
means that as the sample size goes to infinity the probability of selecting the 
true model goes to 1. However, this is no longer true in a high dimensional 
setting, where under sparsity both AIG and BIG tend to select too large 
models [? ]. As a consequence a number of different modifications of BIG 
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have been suggested, for example mBIC [? ? ] which is designed to control 
the family wise error rate (FWER), mBIC2 [? ? ] controlling the false dis¬ 
covery rate, or EBIC [? ] for which consistency under certain asymptotic 
conditions has been shown even when the number of regressors is allowed to 
be larger than the sample size. 

Thus from a theoretical perspective it is rather appealing to perform 
model selection using generalized information criteria. However, the cor¬ 
responding optimization problem is notoriously difficult due to the non¬ 
convexity and discontinuity of the Lq penalty. It is an NP hard problem 
to find the model which minimizes a specific information criterion, and in 
general already for a moderate number of say fifty variables it becomes 
computationally infeasible to guarantee finding the optimal solution. An¬ 
other problem often associated with Lq penalties is the instability of se¬ 
lected solutions [? ]. A possible workaround is to report not only one model 
which minimizes the criterion, but a number of good models which have 
been found for example with some evolutionary algorithms [? ]. In any case 
the approach remains extremely computer intensive and time consuming for 
high-dimensional data sets. 

Regularization methods can serve as an alternative, where penalties are 
not based on the number, but rather on the size of coefficients. A prominent 
example is bridge regression [? ] which uses penalties of the form 
where /?* are the coefficients of the model to be estimated. Special cases are 
ridge regression [? ] for q = 2 and the Lasso [? ] for q = 1, whereas for 
g —)■ 0 the penalty of bridge regression converges towards the Lq penalty 
of generalized information criteria. It has been shown that only for q < 1 
bridge regression can perform variable selection [? ], on the other hand only 
for (7 > 1 its penalty is convex and therefore allows for relatively simple 
optimization algorithms. This partly explains the huge interest that the 
Lasso (q = 1) has received in recent years (see [? ] for a comprehensive 
treatment). 

The Lasso has very nice properties in terms of prediction, but as a model 
selection procedure it is consistent only under rather restrictive assumptions 
[? ? ]. Specifically for strongly correlated regressors it can perform quite 
poorly, and a number of non-convex penalties have been studied to achieve 
sparser solutions [? ]. Eurthermore the coefficient estimates of the Lasso are 
severely biased due to shrinkage. An interesting procedure to overcome these 
deficits is the adaptive Lasso [? ], which makes use of a weighted Li norm 
penalty resulting in a similar convex optimization problem as the original 
Lasso. With suitable choice of the weights the adaptive Lasso was shown 
to have the oracle property, which means that it is both consistent and 
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the nonzero coefficients are estimated as well as when the correct model 
was known. The weights for the adaptive Lasso can be obtained with some 
initial Lasso estimates, and if this procedure is further iterated one obtains 
a multi-step adaptive Lasso [? ? ] 

Already much earlier Grandvalet showed that the Lasso estimate can be 
obtained via some weighted ridge regression [? ? ]. He called his procedure 
adaptive ridge regression, of which a slightly modified version has been re¬ 
cently applied to detect rare variants in genome wide association studies [? 
]. In this article we want to study a different adaptive ridge procedure, which 
was recently proposed [? ] with the aim of approximating Lq penalties. This 
Adaptive Ridge (AR) procedure is somewhat similar to the multi-step adap¬ 
tive Lasso, in the sense that the weights are iteratively adapted; but in each 
iteration weighted ridge regression is performed instead of weighted Lasso, 
which is computationally much easier. 

The iteratively adapted weights of AR are designed in such a way that 
the resulting penalty converges towards the Lq penalty. Therefore the pro¬ 
cedure is somewhat related to the seamless Lo-penalty [? ] and the com¬ 
bination of Lq and Li penalties suggested in [? ], which both represent 
regularized versions of the Lq penalty. However, the latter procedures rely 
upon non-convex optimization, which gets computationally rather difficult 
for large-scale problems as well as for applications beyond linear regression. 
In contrast each iteration of the suggested AR is extremely fast, and we will 
see that the method also performs really well in some non-linear examples. 

The main purpose of this article is to look more systematically into the 
statistical properties of the AR procedure proposed in [? ]. After introducing 
the general procedure in Section 2, we will focus in Section 3 on the special 
case of linear regression. In particular we will provide some theoretical results 
on the behavior of AR under an orthogonal design, and we will show to which 
extent these results apply for more general design matrices. In Section 4 the 
performance of AR will be studied for generalized linear models and for least 
squares segmentation. We finally end with a discussion in Section 5. 

2. General Procedure. 

The Problem. Consider a parametric model with parameter vector (3 G 
in combination with a convex contrast C : —)• M. The most common 

examples of contrasts C{(3) are the residual sum of squares, or minus twice 
the log-likelihood of a given model, but more general functions like pseudo¬ 
likelihood related quantities are conceivable. For all 0 < (? < 2, A > 0 we 
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introduce the penalized contrast 

( 1 ) + • 

Remark 2.1. One can easily replace (3 in the penalty term by any lin¬ 
ear transformation Df3 allowing to consider wider generalizations of penalty 
forms. For example one might consider a subspace extraction such that 
Dp = Pj for a given set J C {1,2 ,..., d}, or a difference matrix such 
that Dp = [Pi — P 2 , P 2 — P 3 , ■ ■ ■, Pd-i — Pd)'^ (where denotes the transpose 
operator). We will make use of this only in the example of Section 4-2. All 
the results obtained previously can he straightforwardly extended for penal¬ 
ties of the form \\DP\W^, but the generalization is omitted for the sake of 
simplicity. 

The objective of this paper is to minimize the penalized contrast of equa¬ 
tion (1) in order to obtain: 

(2) ^ = argnnnC'A,g(/3). 

This relates to Bridge regression for g > 0 [see ? ], with the special cases of 
ridge regression for q = 2, and LASSO for g = 1 [? ]. Note that if q > 1, 
the penalized contrast is both convex and and the problem can be easily 
solved with straightforward convex optimization (Gradient descent, Newton- 
Raphson, etc.). For q = 1, the problem is still convex but with derivative 
singularities that makes the optimization problem more delicate but still 
tractable (coordinate descent [? ], gradient LASSO [? ], etc.). If 0 < g < 1, 
the penalized contrast is not convex anymore and the problem is much more 
challenging [? ]. For the limiting case q = D one obtains for suitable choices 
of A the classical model selection criteria AIC and BIG. Only for very small 
p it is possible to apply exact algorithms which guarantee to find the min¬ 
imal solution [? ], whereas for p > 20 one essentially has to use heuristic 
search strategies like stepwise selection procedures. However, variable selec¬ 
tion based on Lq penalties is believed to be optimal for achieving sparsity 
and unbiasedness, and therefore there is much interest to find efficient algo¬ 
rithms which minimize Cx^q also in case of g = 0. 

The Suggested Solution. Recently Rippe et al. [? ] suggested a method for 
visualizing changes of copy number variation along the chromosome which is 
based on an iterative procedure to minimize residual sum of squares with Lq 
penalties. We will adapt this procedure to our setting of penalized likelihoods 
and discuss it in a slightly more general form. The idea is to obtain 0 
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through an iterative weighted fixed-point procedure. For any A > 0 and any 
non-negative weight vector w G we introduce the function: 

(3) = C{f3) + ^P^dmg{w)f3 = C{f3) + , 

i=i 

where diag(i(;) is the diagonal matrix with weights w on its diagonal. We 
are now ready to introduce our Adaptive Ridge procedure: 

Definition 2.1 (AR). For any A > 0 and 0 < q < 2, the Lg Adaptive 
Ridge sequences and are defined by the initialization = 1, 
and for k G N by: 

(4) = arg mm ^^,^(^-1) iP) 


( 5 ) 


w 


(fc) 






where Equation (5) is defined component-wise, and depends on the constants 
5 > 0 and 7 > 0. 


Equation (4) is just a weighted version of ridge regression, which is usually 
fast to solve. Note that for q = 2 one always has wF) = 1 and thus the 
procedure is not really iterative. In contrast for q < 2, does depend on 
the iteration step k, and in case of convergence of the sequence we will 
write —)• /3. 

The form of the weights wF) of Equation (5) is motivated by the heuris¬ 
tic consideration that at least formally the penalty term of (3) converges 
towards the penalty term of (1), 

( 6 ) 






=1 m'r + 5^) 


(2-q) 

7 


d 

ElA 


Q — 


P 


For q = 1 one obtains in the limit the Lasso penalty by iteratively solv¬ 
ing weighted ridge problems, which has been exactly the motivation of the 
Adaptive Ridge Approach introduced in [? ]. However, the main aim of our 
Adaptive Ridge procedure AR is not to approximate the Lasso, but to focus 
on 0 < g < 1, and especially on the case q = 0. As a consequence our AR 
is more similar in spirit to the multi-step adaptive Lasso discussed in [? ] 
and [? ], where iteratively the weights of the ii penalty are updated using 
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formulas which are very similar to equation (5). More precisely both refer¬ 
ences make use of 7 = 1 , whereas we will later recommend to work with 
7 = 2. Furthermore one finds 5 = 0 in [? ], whereas [? ] introduces <5 > 0 
for numerical stability. Again we will discuss the exact choice of 6 in our 
procedure below. 

The main advantage of our AR approach compared with the multi-step 
adaptive Lasso is that solving a ridge problem in each iteration is much 
easier than solving a lasso problem. While AR works for any < 1 we will 
focus here on the case q = 0, which corresponds to a number of widely 
used variable selection criteria, and for which minimizing ( 1 ) is particularly 
difficult. In fact this optimization problem is NP hard with growing p, and 
thus it is very useful to have a good approximate procedure. 



Fig 1. Approximation 0 / hy the function in dependence of the 

parameter 7 G {1, 2, 3}. The x-axis is at the scale of 5. The four panels illustrate the cases 
q = 1.5,1,0.5,0. 

Numerical considerations. In order to avoid numerical instabilities (mainly 
due to floating point arithmetics), we suggest to use instead of (5) the fol- 
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lowing formula to update the weights Wj : 

f exp S^loglp 

(7) Wj = 


SI 2 exp S^loglp f §- 



|/3j 19-2 exp 

^loglp 

(5 

/3. 

’)] 


if \^j\ < 6 
if \^j\ > 6 


where loglp is the classical function defined by loglp(M) = log(l + u) (for 
all u > — 1 ). 

According to Definition 2.1 the AR procedure depends on two parame¬ 
ters, 5 and 7 . The choice of 5 calibrates which effect sizes are considered as 
relevant. If < 5 the corresponding weight Wj will become large. Eventu¬ 
ally one will obtain in the limit jSj ~ 0, and thus also Wj$‘j ~ 0. On the 
other hand for f3j S> 5 it holds that ^ A choice of 5 = 0 (like 

in [? ]) might then appear to be reasonable, but our numerical experiments 
show that it leads to numerical instabilities and that a small (f > 0 (like in 
[? ? ]) performs noticeably better. Simulation results (not presented in this 
manuscript) suggest that the procedure is not particularly sensitive to the 
exact choice of 5, which coincides with the findings of [? ] in case of adaptive 
lasso. Throughout this paper we will thus work with 5 = 10“®. 

The second parameter 7 determines the quality of the approximation 
~ Figure 1 illustrates for several choices of q the shape of 

depending on the parameter 7 . Clearly for increasing values of 7 the ap¬ 
proximation is getting closer to the desired thresholding step function. In 
simulations not presented here we observed dramatic improvement of the 
performance of AR by raising the parameter from 7 = I.O (like in [? ? ? 
]) to 7 = 2.0 (like in [? ]), while further increasing of 7 did not yield much 
more benefit. 

For the rest of the paper we will focus on the variable selection case q = 0, 
and stick with the choice 6 = 10“^ and 7 = 2 . The Adaptive Ridge Regres¬ 
sion procedure for Lq regularization is therefore defined by the following 
(component-wise defined) weighting scheme: 

( 8 ) . 

Finally it is interesting to point out that AR can easily cope with situa¬ 
tions where solving the weighted ridge problem requires some iterative nu¬ 
merical algorithm for optimization, like gradient descent, Newton-Raphson, 
Marquardt, etc. The idea is simply to update the current value of 
through the iterative numeric procedure rather than computing the exact 
solution to the ridge problem in each step. In other words we propose to mix 
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the iterative schemes of AR and the optimization algorithm. For example, 
the Newton-Raphson version of our procedure can be described as follows: 


Definition 2.2 (Newton-Raphson Adaptive Ridge). For any A > 0 the 
Newton-Raphson Adaptive Ridge sequences and are defined by the 
initialization = 0 and = 1 , and for k £ N by 


(9) 






with weights being updated according to equation (8). 


3. Linear Regression. In this section we will systematically study AR 
with g = 0 as a variable selection procedure in the context of linear regres¬ 
sion. Thus consider the model 


(10) y = Xp + e, 

where y £ M”, X = (Xi,..., Xp) £ and /3 G M^. The error terms are 
assumed to be i.i.d. normal, £i ~ M{0, u^). Furthermore let y be centralized, 
that is Vi — 0) and let all regressors be centralized and standardized 

such that XJ Xj = n. Specifically this means that we consider only models 
without intercept. 

Clearly the log-likelihood of model (10) is of the form 

£(/3,cr^) = const. - nlogfj - ^(X/3 - y)^(X/3 - y) . 

Then —2£ takes the role of the convex contrast C in (1), and in case of 
known error variance ci^ we obtain (after neglecting constants) 

C(/3) = ^iXf3- yf{Xf3 - y) = . 

Variable selection with classical model selection criteria like AIC or BIC 
becomes a special case of (1) with q = 0. More specifically let a model be 
defined by the set of non-zero coefficients M = {j : fij 0}. Then (1) 
becomes 

(11) Ca,o(/3) = 4rSS(/3) + A|M| , 

which for a given model M is clearly minimized at (3]^, the maximum like¬ 
lihood estimate with respect to the given model. 
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We now want to compare variable selection based on (11) with the AR 
procedure defined by (4) and (8). It is straight forward to see that for linear 
regression (4) can be written as an explicit dynamic system, 

(12) = (^X^X + Acj 2 diag(m))”^ X'^y . 

One major result of this section is concerned with shrinkage of coefficients 
resulting from the AR procedure. It turns out that the non-zero coefficients 

of /3 = lim/3 obtained via (12) are smaller in absolute terms than the 

- M 

maximum likelihood estimates (3 of a model M containing exactly the 
same non-zero coefficients as /3. Closely related is the fact that AR with 
parameter A = A does not directly correspond to variable selection based on 
minimizing Cxfi{f3), but that a smaller value of A must be chosen. We will 
first give a theoretical presentation of these results for orthogonal regressors, 
and then illustrate the situation for the general non-orthogonal case based 
on simulation results. 

3.1. Orthogonal case. Assume that p < n and that the design matrix 
fulfills X^X = nip, where Ip is the identity matrix of dimension p. Then 
the usual maximum likelihood estimate of /3 for the saturated model becomes 
^ ^ and (11) evaluated at the maximum likelihood estimate of any 

given model can be rewritten as 

Cxfl = w ( y^y “ ^ X] ) + -^1^1 ■ 

Thus the penalized likelihood is minimized when all those regressors enter 
the model for which 

(13) /3| > Acj^/n , 

which results in the well known fact that under orthogonality the model 
selection approach defined by (11) is nothing else but a thresholding pro¬ 
cedure for the individual coefficients. Note that the whole argument relies 
upon the fact that in case of orthogonality the coefficients are estimated 
independently from each other. 

We next argue that AR also becomes a simple thresholding procedure 
under orthogonality. First note that (12) can be rewritten as 

( 14 ) = ^ = 
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X 


Fig 2. Function which determined the dynamic system (16) for 5 = 10 ® and /3j = 0.9. 


where we define K = \a^ jn. Thus we have for each coefficient a one¬ 
dimensional dynamic system independent of the other coordinates, which 
is easy to solve. Equation (14) already indicates the shrinkage of the limit 
/3j compared with the ML estimate /3j. The stationary points of the sequence 
^ can be found by solving the equation 


(15) 


A' 1 + 


K 


6^ + Pl 




'{k) 

For the sake of notational convenience let’s write xy = fij ■ We thus 
study the dynamic system 


(16) 


Xk = 




f{xk-l)' 


with /(x) = 1-k + t"^) 


2\-l 


which is illustrated in Figure 2. As long as K > 85^ (which is essentially 
always the case) the function xf{x) has two positive local extrema. The 
dynamic system (16) has only one stationary point xy when the function 
value of its positive local minimum x* is larger than 13j, that is 

(17) x*/(x*) > Pj, with x^ = y - -k ^\/(iL - 2 ^ 2)2 _ 4(^2 ^ 


If (5 <C 77 this roughly means that (3j < 2y/K. In that case it is easy to see 
that the only stationary point x/ is attractive, and one has x^ —)• x/ ~ 0 
(see Figure 2 for 77 = 0.3). 
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The other common situation occurs when the inequality in (17) changes, 
that is when essentially /3j > 2\^. Then it holds that (15) has three solu¬ 
tions XI < xji < XIII. Standard arguments from the theory of dynamical 
systems show that xi and xm are attractive, that is for xi < xn one has 
Xk —>■ xi, otherwise if xi > xn then Xk xm (see Figure 2 for K = 0.1 
and K = 0.05). Note that xi, xn and xm are the roots of a polynomial of 
third degree for which explicit formulas are available. 

In the exceptional case where there are only two stationary points the 
dynamic is such that for xi < xn one has again —)> xj, but for xi > x// 
one has Xk —)• x//. Thus xn is a saddle point (see Figure 2 for iF = 0.2). 

Convergence of Xk —)• x/ can be interpreted as (3j = 0, although 0 < 
Xi ~ 6‘^j3j/K. However, numerically this is small enough to be indistin¬ 
guishable from zero as long as 5 is sufficiently small. Thus from a model 
selection perspective convergence towards x/ indicates that a coefficient has 
been excluded, whereas the limit xm corresponds to regressors which have 
been included in the model. Furthermore equation (15) shows the amount 
of shrinkage that a regression coefficient suffers from AR. The larger xm 
and the smaller K, the less shrinkage. 

Two conditions have to be fulfilled that a regressor is selected by AR. 
Firstly the dynamical system of the component must have three fixed points, 
which corresponds to the condition that K < /S|/4. Secondly it is then 

necessary that xi > x/j. Remember that AR computes in its first step '0' ^ 
by standard ridge regression, and therefore xi = /3j/(l -|- K). On the other 
hand a very good approximation of x// can be obtained by letting d = 0 in 
(16) and then solving the corresponding stationary equation, which results 
in 

XII ~ /3i/2 - . 

As long as AT < 1 it then always holds that xi > x//, and it follows that the 
dynamic of AR under orthogonality is completely determined by the number 
of fixed points for each regressor. To summarize, under orthogonality AR 
becomes a thresholding procedure where a regressor is selected in case of 

(18) /3| > 8A(T^/n . 

Comparing conditions (13) and (18) then yields 

Proposition 3.1. Under orthogonality performing AR with A corre¬ 
sponds to minimizing (11) with A = 4A. 

Remark: The result holds under the condition that K < 1, or equivalently 
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that A < n/a^. In practice this seems to be no huge restriction. To give some 
examples, the penalties of AIC, BIC and mBIC are A = 2,A = logn and 
A = log(np^/4), respectively. As long as y is reasonably scaled the condition 
K <1 will always apply. 


3.2. Non-orthogonal case. For the non-orthogonal case a full analysis 
of the dynamical system (12) becomes way more complicated, because it 
cannot be reduced any longer to independent analysis for the individual 
coefficients. Instead of attempting to obtain analytic results we will focus 
here on illustrating the most important features of AR by presenting results 
from simulations. Before that we only want to mention that as a simple 
consequence of (12) it always holds that 


(19) 


P 


{k) 


< 


ix^x) 




~ (k) 

and thus the sequence of /3 remains bounded. However, it turns out that 

~ (k) 

the mapping underlying the dynamic system /3 is usually not a contrac¬ 
tion, and therefore theoretical convergence results are rather hard to obtain. 
In fact changing the initial value of the weights can have some effect on 
the limit of /3 , though usually the obtained solutions are not too different 

from each other. 

Initial value = 1 Disturbed initial value 




Iteration k Iteration k 

Fig 3. Convergence of procedure for one simulated instance where the standard initial 
value is compared with a different choice of 


Figure 3 provides a typical example that illustrates the behavior of AR for 
the general linear case. We simulated one instance according to (10) with 
p = n = 100, where the correct model had k* = 24 regressors. The first 
plot uses our standard initial value = 1 for all components, whereas 
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in the second plot the components of the initial value are randomly chosen 
between 1/2 and 3/2. The models resulting from the two starting points 
differ only in one regressor, where a true positive detected by the second 
model is substituted in the first model by a false positive. Otherwise both 
models contain the same non-zero coefficients, for which estimates can also 
slightly differ. For this instance trying further random initial values of ~ 
U (0.5,1.5) provided a third limiting model which added one false positive to 
the second model. Interestingly the model obtained with the second starting 
point which was doing best in terms of misclassification had the largest BIG 
criterion (141.03), while the other two models had almost identical BIG 
criterion (140.06 and 140.07). In general our experience with simulations 
shows that although the limit of the AR procedure depends on the starting 
point, the different solutions obtained will have very similar values of the 
selection criterion that one attempts to approximate. In fact the instability of 
solutions does not come as a surprise bearing in mind that variable selection 
based on information criteria is well known to suffer from instabilities with 

respect to small changes within the data [? ]. 

~ (fc) 

Note that any component of the sequence (5 which once has approached 

~ (^) 

zero also remains there. This can be easily understood because for small (3j 
the corresponding weight wj becomes very large, and the matrix X + 
Acr^ diag(io) becomes essentially orthogonal with respect to the j-th compo¬ 
nent. This mechanism of the procedure can be used to force some to stay 
in the model regardless of the penalty, simply by setting its corresponding 
initial weight wj to 0. This might be useful in practice if one would not like 
to perform model selection on a certain subset of regressors. The majority 
of coordinates converging to zero does so within less than 10 iterations, but 
there are some exceptions for which convergence to zero takes substantially 
longer. The instability of the AR model as a function of the initial values ap¬ 
pears to depend mainly on the behavior within the first few iterations, where 
for the majority of coefficients it becomes clear whether they are selected or 
not. 

3.2.1. Correlated regressors. Next we look at two very simple scenarios, 
where we study more systematically the behavior of AR when regressors are 
correlated. To this end we consider correlation structures from compound 
symmetry and auto regressive models. In both scenarios the parameter p 
varies between 0 and 0.8, where p specifies pairwise correlation between 
neighboring covariates for auto regression (Scenario 2), and pairwise cor¬ 
relation between all regressors for compound symmetry (Scenario 1). We 
consider with p = 15 a relatively small number of regressors. This allows 
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for a systematic examination of the performance of AR compared with all 
subset selection, which is for p = 15 still conveniently possible. For each sce¬ 
nario we simulate 500 traits for n = 50 individuals based on linear models 
with 5 regressors having nonzero coefficients. The effects are all chosen to 
be /3j = 0.5, which equals half of the predefined standard deviation <7 = 1. 
Regressors entering the model were chosen to be j G {1,..., 5} for Scenario 
1, and j G {2,5,8,11,14} for the second scenario. Selection based on BIC 
is compared with AR using parameter A = log(n)/4, that is we use the 
relationship A = 4A as suggested by Proposition 3.1. 


Gompound Symmetry Auto regression 



Correlation Correlation 


Fig 4. Difference of BIC between model obtained with AR and best model. The numbers 
below the boxplots give the relative frequency of simulation runs in which AR gave the 
optimal solution. 

Figure 3.2.1 illustrates to which extent AR yields the optimal model ac¬ 
cording to BIC. For small correlations AR gives in the majority of cases the 
same model as all subset selection, which starts to change only for p > 0.3. 
In Scenario 2 AR yields more often the optimal model than in Scenario 1, 
when comparing results at the same level of pairwise correlation. Clearly a 
compound symmetry model provides in general more correlation between 
regressors than an autoregressive model, and we might conclude that AR 
differs increasingly from all subset selection the farther away one gets from 
orthogonality. 

Interestingly from a statistical point of view AR seems to perform almost 
better than all subset selection based on BIC. For the majority of cases AR 
has less misclassifications than all subset selection (see Table 1). Specifically 
for Scenario 1 AR tends to have larger power to detect the correct regressors, 
while controlling the Type I error at a similar rate like BIC. On the other 
hand in Scenario 2 AR tends to give less Type I errors, while having similar 
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power to BIC. In summary one might conclude that for p < n (at least in 
these two scenarios) the choice of A = 4A from Proposition 3.1 worked quite 
well even in the non-orthogonal case. 

3.2.2. High-dimensional setting. A large number of recent statistical ap¬ 
plications are confronted with the challenging task of model selection when 
p > n. Here we perform simulations under the assumption that regressors 
are independent normally distributed variables. Sample size was fixed with 
n = 100, while for the growing number of potential regressors we considered 
p G {100, 250, 500,1000, 2500, 5000,10000}. For each setting 1000 models of 
size k* = 24 were simulated from (10), with normally distributed random 
effect sizes jdj ~ N{0,0.5), j G {1,..., 24}, and again an error standard de¬ 
viation of fj = 1. Keeping k* fixed gives with growing p an increasingly 
sparse situation. Hence the model selection criterion mBIC is more appro¬ 
priate than BIC (see [? ]), but here we are mainly interested in studying the 
properties of AR and will therefore show results for both criteria. 


Table 1 

Comparison of the performance of all subset selection (BIC) with AR in terms of power, 
number of false positives (FP), false discovery rate (FDR) and number of 
mtsclassifications (Mis). For Scenario 1 the correlation (Corr) refers to pairwise 
correlation between all regressors, for Scenario 2 only for neighboring regressors. 


Power 


FP 


FDR 


Mis 


Corr 

BIC 

AR 

BIC 

AR 

BIC 

AR 

BIC 

AR 

Scenario 1: 

0.0 

0.85 

0.83 

0.54 

0.52 

0.11 

0.10 

1.30 

1.39 

0.1 

0.81 

0.84 

0.61 

0.62 

0.11 

0.11 

1.54 

1.42 

0.2 

0.82 

0.86 

0.56 

0.51 

0.10 

0.09 

1.47 

1.23 

0.3 

0.79 

0.83 

0.65 

0.66 

0.13 

0.12 

1.71 

1.50 

0.4 

0.75 

0.80 

0.61 

0.62 

0.12 

0.12 

1.86 

1.60 

0.5 

0.68 

0.74 

0.79 

0.73 

0.17 

0.15 

2.38 

2.02 

0.6 

0.66 

0.72 

0.61 

0.65 

0.14 

0.14 

2.30 

2.04 

0.7 

0.56 

0.62 

0.71 

0.76 

0.19 

0.18 

2.90 

2.68 

0.8 

0.49 

0.54 

0.84 

0.87 

0.25 

0.24 

3.41 

3.19 

Scenario 2: 

0.0 

0.85 

0.83 

0.54 

0.47 

0.10 

0.10 

1.29 

1.35 

0.1 

0.86 

0.88 

0.55 

0.53 

0.10 

0.09 

1.24 

1.11 

0.2 

0.81 

0.80 

0.62 

0.54 

0.13 

0.11 

1.58 

1.52 

0.3 

0.67 

0.62 

0.70 

0.65 

0.19 

0.18 

2.37 

2.56 

0.4 

0.88 

0.88 

0.74 

0.72 

0.13 

0.13 

1.36 

1.32 

0.5 

0.81 

0.83 

0.84 

0.79 

0.17 

0.15 

1.81 

1.62 

0.6 

0.80 

0.80 

0.91 

0.93 

0.18 

0.18 

1.93 

1.94 

0.7 

0.72 

0.75 

1.00 

0.89 

0.21 

0.18 

2.41 

2.13 

0.8 

0.61 

0.65 

1.30 

1.24 

0.30 

0.28 

3.27 

3.00 
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Power 


Average number of FP 



Number of Misclassifications 





Fig 5. Comparison of model selection performance based on some stepwise selection pro¬ 
cedure for BIC and mBIC with the corresponding AR procedures. The four panels show the 
average over 1000 simulation runs of power, number of false positives, number of misclas¬ 
sifications and false discovery rate as a function of the total number of potential regressors 
p. Data were simulated under a model with fe = 24 regressors. 


We want to compare the performance of variable selection using simple 
stepwise search strategies for the two information criteria BIC and mBIC 
with their respective AR procedures. Our stepwise procedure is fairly simple. 
It starts with a model including the best 40 regressors according to marginal 
test statistics. Then greedy backward elimination is performed all the way 
down to a model of size one. That model along the way which minimizes the 
criterion in question is then considered as the starting point for some final 
greedy forward selection which is performed till no more improvement of the 
criterion is obtained. For the AR procedure we use again the relationship 
A = 4A from Proposition 3.1. Before applying AR the top 100 regressors 
were preselected based on marginal tests, which noticeably improved the 
performance of AR. 

We start with discussing Figure 5, which compares classification charac- 
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teristics of the four procedures. Only for n = p = 100 BIG and mBIC are 
comparable in terms of misclassification. With growing p BIG produces ex¬ 
ceedingly more false positives than mBIG, which cannot be compensated by 
the relatively mild gain in power. Both for BIG and mBIG the AR procedure 
is more conservative than the corresponding stepwise selection procedure, 
which means that it is less powerful, but produces also less false positives. 
Interestingly for both criteria AR produces less misclassifications than step¬ 
wise selection. 



Fig 6 . Boxplots of differences between values of selection criteria for models obtained with 
stepwise search strategy and with AR. The first panel shows results for BIC, the second 
panel for mBIC. Results are based on the same data as Figure 5. 

Looking again at the differences of criteria for models obtained with step¬ 
wise selection and with AR, one can see that for p getting larger AR tends 
to give models with larger values of the criterion than stepwise selection. 
However, even for the largest p there are at least some instances where AR 
gives better models according to each criterion than stepwise selection. For 
p = n = 100 AR and stepwise selection perform more or less identical, where 
the median of differences is almost exactly at 0. In case of BIG the median of 
differences increases with p till p = 1000 and then remains constant, whereas 
for mBIG the median of differences continues to grow also for larger values 
of p. It is interesting to observe that although for p > n AR does usually 
not manage to hnd those models that minimize the information criterion, it 
outperforms the corresponding stepwise selection procedure with respect to 
misclassification. 

The fact that the AR procedure is for p > n more conservative than 
stepwise selection gives rise to the question whether the relationship A = 4A 
from Proposition 3.1 is still correct, or whether one would rather have to use 
in that situation more relaxed penalties to compensate for shrinkage. Our 
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simulation results did not provide a definite answer to this question, but we 
will see in the next section on generalized linear models that in principal it 
is easy to obtain solutions of AR for a whole range of A values, among which 
one can then choose the model which minimizes the original Lq penalty with 
parameter A. 

4. Further Applications. In this section, we consider two more ap¬ 
plications of the adaptive ridge approach in order to illustrate its usefulness 
beyond linear regression. We first discuss in Section 4.1 two particular cases 
of generalized linear models, Poisson regression and logistic regression. As 
the weighted ridge problem associated with these two models has no closed- 
form solution, we rely upon the Newton-Raphson adaptive ridge version (9) 
of our algorithm to solve the corresponding optimization problems. After¬ 
wards we reconsider in Section 4.2 the least squares segmentation problem 
for which AR was first introduced in [? ], but we improve on the original 
publication by deriving explicit recursive formulas for solving the weighted 
ridge problem rather than relying on (sparse) LU decompositions. As a re¬ 
sult, our approach is much faster than the original one. 

4.1. Generalized linear model. 

4.1.1. Poisson regression. To illustrate how to apply AR in the context 

of generalized linear models we will discuss Poisson regression and logistic 
regression models. We start with the classical Poisson regression problem 
Hi ~ V{ni{P)) where = exp(Xj/l) with y,n & M”, X G and 

(3 G M^. In order to maximize the Lq penalized log-likelihood of the problem, 
we introduce for any penalty A > 0 and weight vector m G the following 
weighted ridge penalized log-likelihood: 

(20) £(/3; A, w) = const. -|- (3^X^y — diag(in)/3 , 

where u G M” is an all-one column-vector. For given A > 0 we want to 
maximize this quantity using the Newton-Raphson algorithm. Simple com¬ 
putations give the first two derivatives of £(/3; X,w), 

(21) V£(/3; A, w) = X^y — X'^y,{(3) — A diag(m)/3; 

(22) Hess£(/3; A, m) = —diag(/.i(/3))X — Adiag(m). 

Maximizing Eq. (20) can therefore be done iteratively using the following 
update for /3: 

(23) P P — Hess.^(/3; A, m)“^V£(/3; A, w). 
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The AR procedure also requires to update the weights w according to (8). 
Combining both updates leads to the computationally efficient procedure 
whose R-code reads as follows: 


w=rep(l.0,p); beta=rep(0,p); 
for (iter in 1:itermax) { 
niu=exp(X°/o*'/obeta) [,1] 

A=t (X)7o*yodiag(mu)yo*7oX+lambda+diag(w) 
b=t (X) y,*"/, (y-mu) -lambda*w*beta 
beta=beta+solve(A,b)[,1] 
w=l.0/(beta*2+delta~2) 


We want to point our that the resulting code is quite compact and extremely 
easy to understand and to implement, which is in stark contrast to the 
available LASSO implementation of the same problem [? ] which uses a 
rather delicate coordinate descent algorithm. 

Like in case of LASSO it is also possible for AR to take advantage of a 
warm start of the algorithm to obtain the full regularization path of the 
problem (see Figure 7). For that purpose, we start with a near null penalty 
A, and then increase the value of the penalty using for each new penalty the 
previously computed weight vector w and parameter [3 as starting points. 
Obtaining the full regularization path is of particular importance in case of 
GLM because we do not have any theoretical results like Proposition 3.1 
telling us which A of AR corresponds to the A of a given selection criterion. 
Interestingly, in case of Poisson regression and the BIC criterion it turns out 
that like in case of linear regression the factor 4 works really well. 

In Figure 8 we compare AR Poisson regression with A = log(n)/4 to 
the standard stepwise selection procedure based on BIC. Two simulation 
scenarios are considered, the first one with p = 50, the second one with 
p = 500, where both scenarios use a sample size of n = 300. Count data 
were simulated from Poisson regression models of size A: = 10 and k = 
25, respectively. The covariates Ajj- and the non-zero coefficients were 
independently drawn from normal random variables according to Xi^j ~ 
W(0,0.1^) and ~AA(0,1.5^). 

For the first scenario with p < n AR and stepwise selection give almost 
identical results, which is quantified by the extremely small mean squared 
error (MSE) of the difference between obtained BIC values. This illustrates 
on the one hand that the Newton-Raphson AR procedure (9) works really 
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^4 ^2 0 2 4 

log(lambcia) 


Fig 7. Example of a full regularization path for Lq adaptive Ridge Poisson regression with 
n = 300, p = 50, and (3* = 0 except for the first k = 10 coordinates. The covariates 
Xij and non-zero coefficients (3j are independently drawn from normal random variables 
according to Xij ~ A/'(0,0.1^) and fij ~ A/'(0, 1.5^). 




(a) (b) 


Fig 8 . Comparison of BIC criteria obtained with stepwise search and with AR for Poisson 
regression. Panel (a): n = 300, p = 50, k = 10; Panel (b): n = 300, p = 500, k — 25, 
where n is the sample size, p the total number of regressors and k the size of the data 
generating model. 


well, and on the other hand that A = log(n)/4 is a perfect choice in this 
setting. In the high-dimensional setting with p = 500 both methods still 
give very similar results, but with more distinct differences. Note however 
that differences go in both directions, and there is no clear trend observable 
that stepwise procedures would give better results than AR. 
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MSE=19.65 




(a) (b) 


Fig 9. Comparison of BIC criteria obtained with stepwise search and with AR for logistic 
regression with n = 300, p — 50, and k = 10. In Panel (a) AR is performed with A = 
log(n)/4, in Panel (b) with X = log(n)/5. 


4.1.2. Logistic regression. The classical binary logistic regression model 
is of the form m ~ B{'i:i{(3)) where 7rj(/3) = 1/(1 + exp(—Xj/3)) with y G 
{0,1}”, TT G [0, !]”■, X G and (3 G MP. Just like in the case of Poisson 

regression we introduce for any penalty A > 0 and weight vector w cMP the 
weighted Ridge penalized log-likelihood: 

(24) 

£(f3;X,w) = {{I - y)log{l - 7 v{( 3)) + ylog{7T{P))} - ^A/3^ diag(m)/3 , 

where u G M”' is an all-one column-vector. The Newton-Raphson AR (9) 
needs again the two first derivatives of that penalized likelihood function, 
which are 

(25) V£(/3; A, w) = X^{{1 - 7r(/3))y - (1 - y)Tv{f3)) - A diag(m)/3; 

(26) Hess£(/3; X,w) = — diag(7r(/3)(l — 7v{(3))X — Adiag(m). 

Like for Poisson regression we compared AR logistic regression to a stan¬ 
dard stepwise selection procedure based on BIC. We present simulation re¬ 
sults for a scenario which is similar to the first scenario for Poisson regres¬ 
sion, with p = 50, n = 300 and k = 10. Again both the covariates Xij and 
the non-zero coefficients f3j were independently drawn from normal random 
variables according to Wj ~ AA(0,0.1^) and ffj ~ AA(0, 3.5^). 

Figure 9 illustrates that for logistic regression the relationship A = log(n)/4 
no longer gives the best results, but that the slightly smaller penalty of 
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A = log(n)/5 performs better for this scenario. No more improvement of 
MSE was observed by further decreasing the penalty A (data not shown). 
For A = log(n)/5 AR gives quite similar results to stepwise selection, but 
the agreement is not as strong as in the corresponding scenario of Poisson 
regression, and if there are differences then in the majority of cases AR tends 
to give larger values of BIC than stepwise search. 

In summary we can conclude that the result of Proposition 3.1 does not 
always hold for generalized linear models, but that searching over a range 
of values of A and considering that model which minimizes BIC along the 
regularization path provides a simple and efficient strategy to overcome that 
problem. 

4.2. Least squares segmentation. We finally want to discuss least squares 
segmentation of a one-dimensional signal, which was recently applied in the 
context of analyzing pathological patterns of DNA in tumor tissues [? ]. 
Let y G M" denote n measurements which are spatially (or temporally) or¬ 
dered. Then the problem of segmentation can be formalized by introducing 
Lq penalties for changing the estimated mean between neighboring measure¬ 
ments, 

{ n n—1 ^ 

y'(2/i - + A V1 (/Tj / yi+i) ) , 

^ ^ J 

where !(•) G {0,1} is the indicator function. According to Remark 2.1 this 
fits into our context as a slightly generalized version of the penalized contrast 
(1), and like in [? ] we introduce the following weighted Ridge square loss 
as a generalization of (3): 


n n—1 

(28) SL(/x; A, w) = '^{yi - + A ^ Wi{yi+i - ynf . 

i=l i=l 

For the corresponding AR procedure we again start with the initial weights 
~ 1 and for A; > 1 perform the iterations 

(29) 

/xW = arg min SL {^y-= ^<5^ -h ^ 

The computations of (29) can be easily solved analytically by considering 
the derivatives of SL(/r; A, w). Minimization of the loss function then corre- 
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spends to solving the following set of linear equations: 


(30) 


{yi - yi) + Xwi{fi 2 - fii) = 0 

(2/2 - Ai2) + Xw 2 {fJ .3 - fJ-2) - Xwi{fl2 - m) =0 

iVn fJ'n) XlVn—l^yn l) — 0 


In [? ] it was suggested to solve this problem using an efficient sparse LU 
decomposition. Here we provide a dramatically faster alternative which al¬ 
lows to recursively compute the solution. For i = 1,..., n — 1, let us write 
m = ai + bi^i+i where ai,bi G M. From the linear equations above we obtain 

i = 1; 

1 < A < n , 

with Di = 1 + Xwi + — 6i-i), and finally 


(31) 


oi = 


Qji — 


yi 


1 -|- Xwi 
Vi + Xwi-iaj-i 
Di 


bi = 
bi = 


Xwi 


1 -|- Xwi 
Xwj 
Di 


(32) 


Un “ 1 “ XWn—ldn—l 

l^n — Z I r TZ 7 r t — n , 
1 -F XWn-l{l - bn-l) 

yi = ai + biHi+i i <n . 


Using these recursive formulas one can hence perform one update step of 
(29) in 0{n). Alternatively, one can use dynamic programming to find the 
best solution of (27) with at most femax > 1 segments in 0(/cmax x v?). 
Such a strategy is for example explained in ? ] and implemented in the 
SegmentorSIsBack R package [? ]. 

In order to validate the adaptive Ridge approach in the context of least 
squares segmentation we will compare its performance with the exact ap¬ 
proach in a small simulation study. We consider a simple Gaussian design 
with n = 500 consecutive measurements and three breakpoints at positions 
100, 250 and 375. Based on a Gaussian model 200 data sets were gener¬ 
ated with mean values —0.3,0.7,1.5, 0.5 in the four different segments, and 
a common standard deviation of = 1.0. After performing some calibra¬ 
tion of the parameter A using the previously discussed warm start method 
of AR (Figure 10a) we decided upon using the AR penalty A = A/6, where 
A = 21og(n) is the penalty of the original criterion. This rescaling factor 
appeared to be quite stable for various scenarios, though perhaps increasing 
slightly with growing n (data not shown). 

We can see in Figure 10b a comparison of the SE penalized criterion 
obtained both by exact computations and the AR method. AR clearly gives 
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(c) (d) 

Fig 10. Comparison of exact segmentation (X = 21ogn^ and adaptive Ridge segmentation 
(X = Xf scale). Panel (a) show the calibration of the rescaling parameter which leads to 
scale = 6. Panel (b) compares the exact penalized SE to the one obtained through AR 
with X = 21og(n)/6. Panels (c) and (d) illustrate the segmentation output for two specific 
instances where one observes some disagreement between exact (red solid line) and AR 
segmentation (blue dashed line). 


good results, although sometimes suboptimal. In Figure 10c, d we give two 
examples of such suboptimal situations: in Figure 10c, AR has two misplaced 
breakpoints and the selection of an additional one. In Figure lOd AR missed 
one very small segment that was considered relevant by the exact approach. 
Thus although AR did not find the optimal model in terms of the criterion, 
its solution is in fact closer to the underlying true model. Given the general 
good performance of AR one might conclude that due to its efficiency it 
might be preferable to looking for the exact solution particularly for large 
scale problems. 
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5. Discussion. In this paper we have introduced the adaptive ridge 
procedure AR, an iterative procedure whose purpose is to solve Lq penalized 
problems via weighted ridge optimization. The approach, recently suggested 
by [? ] in the particular context of least squares segmentation, is very simi¬ 
lar to the iterative adaptive Lasso procedure introduced in [? ? ], with the 
noticeable difference that AR requires at each iteration to solve a weighted 
ridge problem instead of the weighted Lasso. As a result, the practical im¬ 
plementation of the adaptive ridge is often dramatically simpler than its 
adaptive lasso counterparts, and it is computationally much less expensive. 
This is illustrated particulary in Section 4, where we provide a simple solu¬ 
tion to the three classical problems of Poisson regression, logistic regression, 
and least squares segmentation. 

It was pointed out in [? ] that the adaptive ridge approach clearly performs 
very well in practice, though any theoretical justifications of that behavior 
was missing. In this paper we partially addressed this problem by studying 
the dynamics of AR in the particular case of orthogonal linear regression 
(with known variance). In this context we derived explicit conditions for 
the convergence of AR and proved that the adaptive ridge penalty needs to 
be four times smaller than the original Lq penalty to give the same results. 
According to our simulations this scaling factor of 1/4 worked quite well also 
in case of non-orthogonal linear regression, as long as the correlation between 
covariates was not too high. In case of highly correlated regressors, as well 
as for p ^ n, further investigation might be necessary, but in general such 
rescaling offers a natural way to select adaptive ridge penalties by targeting 
classical Lq penalty schemes like AIC and BIC, or in a high-dimensional 
setting the more recently suggested mBIC. 

Furthermore the AR procedure, just like the lasso, allows to take advan¬ 
tage of warm starts to compute efficiently the entire solution surface for a 
sequence of penalties. This gives the possibility to select the most appro¬ 
priate penalty of AR without any need to know the rescaling scheme. Note 
that for the adaptive ridge we have to consider increasing penalty values, 
whereas for the lasso one usually considers decreasing penalty values. 

In summary the AR procedure suggested in this paper is quite straight¬ 
forward to understand and implement, can be easily combined with iterative 
optimization procedures like Newton-Raphson, and offers efficient ways to 
compute entire solution surfaces. We hope that this paper could be a first 
step to learn more about the theoretical properties of this method, which 
definitely seems to be worth of further investigation. 
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